Module 2: Why Visualize Data?#
Visualizations help us understand data quicker and in a more effective way, resulting in the ability to extract insights and deeper understanding of the data we’re working with.
If we were just given plain data, it may be difficult to extract patterns from it. However, by transforming this data into visual formats such as charts, graphs, and maps, we can more easily identify trends, correlations, and outliers that might not be apparent in raw data.
For example, time series data can be better understood through line graphs, which can reveal trends over time. Similarly, scatter plots can highlight relationships between variables, and heat maps can provide a clear view of data density or intensity. Visualizations also aid in communicating complex data findings to a broader audience, including those who may not have a technical background.
Interactive visualizations allow users to explore data dynamically, offering the ability to drill down into details and adjust parameters to see how outcomes change. This interactive element enhances the depth of analysis and supports better decision-making.
Without visualizations, the conclusions we reach my be vague and equivocal. For example, summary statistics are indeed helpful, but ultimately are only a partial piece of a bigger story. Moreover, they can be misleading if we haven’t actually plotted out the data yet.
Through the power of visualizations, we can immediately identify patterns within data, which we otherwise wouldn’t be able to from merely looking at raw data alone, or summary statistics too for that matter.
Clinical Questions#
Within this module, we aim to reason about the following questions:
Does whether a not a patient used RHC usage affect their length of stay in the hospital?
Does sex play a role in determining if a patient undergoes RHC usage, and are there any discrepancies between the sex of the patient and their length of stay in the hosptial?
Is there any association between the age of the patient and their length of stay in the hospital?
Imperative vs Declarative Visualizations#
When it comes to data visualization, there are 2 main approaches:
Imperative
Declarative
Imperative (low level)#
Imperative visualizations center around the exact construction details of plots, emphasizing the mechanics behind their creation. This approach relies heavily on the use of loops and low-level drawing commands, allowing for precise control over every aspect of the plot’s appearance.
While this method provides detailed customization and pinpoint control, it can become cumbersome and time-consuming when attempting to create more complex visualizations. Despite the possibility for detailed fine-tuning, this approach may not always be the most efficient choice for data representations.
Declarative (high level)#
On the other hand, declarative visualizations prioritize the discovery of significant relationships within data, rather than the mechanics of plot construction.
This approach involves linking columns of data to visual channels, such as axes, colors, or shapes, to effectively convey the underlying patterns and insights.
By focusing on the essence of the data, declarative visualizations provide the core information needed, without the hassle of having to manage minor plotting details. This method streamlines the creation process, allowing users to concentrate on interpreting the data rather than on the specifics of plot construction.
What is EDA (Explatory Data Analysis)?#
Exploratory Data Analysis (EDA) is a crucial step in the data analysis workflow where analysts use statistical graphs and other data visualization tools to summarize the main characteristics of a dataset. Simply put, EDA is the process of analyzing and visualizing data to uncover patterns, spot anomalies, and gain insights before actually any models are created.
It involves investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.
EDA helps in understanding the underlying structure of the dataset in question, and guides subsequent data processing and modeling. By revealing insights and relationships that may not be immediately obvious, EDA plays an important role in making decisions that are primarily data-driven.
Visualization Grammar#
In order to create informative and effective plots, we must incorporate a high-level grammar of visualizations.
Visualization grammar is a systematic approach to building visual representations of data. By breaking down the process into fundamental components, it allows for more flexible, reusable, and understandable visualizations. This concept is similar to how grammar in a language allows for the construction of meaningful sentences from a set of rules.
Visualization grammar is composed of 3 different grammatical components.
Creating a chart linked to a dataframe. The foundation of any visualization is a data structure, typically a dataframe, that contains the data to be visualized. The first step is to establish this link, ensuring the data is correctly loaded and accessible for plotting.
Adding graphical elements (points, lines, etc). Graphical elements, or marks, are the basic visual components used in a chart, such as points, lines, bars, and areas. These elements represent the data visually and can be combined in various ways to form different types of charts (e.g., scatter plots, line charts, bar charts).
Encoding dataframe columns as visual channels. Visual channels refer to the attributes of graphical elements that can be manipulated to represent data, such as position, size, shape, color, and orientation. Encoding involves mapping the columns of a dataframe to these visual channels, transforming raw data into a visual format that can be easily interpreted.
We will see later how we can incorporate visualization grammar with the Altair package to create visualizations, and thus perform exploratory data analysis.
The Dataset#
Many cardiologists believe that right heart catheterization (through the use of a monitoring device to measure cardiac function) is beneficial in helping manage patients in the ICU.
This RHC dataset was used in a significant study by Connors et al. (1996) published in the Journal of the American Medical Association. The study objective was to evaluate the effectiveness of RHC, also known as the Swan-Ganz catheter, in the initial care of critically ill patients. The data pertains to day 1 of hospitalization, focusing on whether or not a patient received RHC on the first day they qualified for the SUPPORT (The Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments).
Specifically, we aim to investigate potential associations between the use of Right Heart Catheterization (RHC) within the first 24 hours in the ICU and various health-related outcomes, such as length of stay.
Let’s first load in our data from the link https://hbiostat.org/data/repo/rhc.csv.
import pandas as pd
import numpy as np
ObsData = pd.read_csv("https://hbiostat.org/data/repo/rhc.csv").iloc[:,1:]
ObsData.head()
C:\Users\TESTUSER\AppData\Local\Temp\ipykernel_9532\2059020580.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
| cat1 | cat2 | ca | sadmdte | dschdte | dthdte | lstctdte | death | cardiohx | chfhx | ... | meta | hema | seps | trauma | ortho | adld3p | urin1 | race | income | ptid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | COPD | NaN | Yes | 11142 | 11151.0 | NaN | 11382 | No | 0 | 0 | ... | No | No | No | No | No | 0.0 | NaN | white | Under $11k | 5 |
| 1 | MOSF w/Sepsis | NaN | No | 11799 | 11844.0 | 11844.0 | 11844 | Yes | 1 | 1 | ... | No | No | Yes | No | No | NaN | 1437.0 | white | Under $11k | 7 |
| 2 | MOSF w/Malignancy | MOSF w/Sepsis | Yes | 12083 | 12143.0 | NaN | 12400 | No | 0 | 0 | ... | No | No | No | No | No | NaN | 599.0 | white | $25-$50k | 9 |
| 3 | ARF | NaN | No | 11146 | 11183.0 | 11183.0 | 11182 | Yes | 0 | 0 | ... | No | No | No | No | No | NaN | NaN | white | $11-$25k | 10 |
| 4 | MOSF w/Sepsis | NaN | No | 12035 | 12037.0 | 12037.0 | 12036 | Yes | 0 | 0 | ... | No | No | No | No | No | NaN | 64.0 | white | Under $11k | 11 |
5 rows × 62 columns
Creating the Analytic Data#
The outcome Y that we’re currently interested in, is the patient’s length of stay in the hospital.
To compute such a value, let’s take the date of discharge (date patient leaves hospital), and subtract it by the study admission date (date patient enters hospital).
# Creating length of stay
ObsData['Y'] = ObsData['dschdte'] - ObsData['sadmdte']
If the date of discharge is not available, we use date of death instead, subtracting it by study admission date.
missing_Y = ObsData['Y'].isna()
# Replace the missing values in 'Y' with the difference between 'dthdte' and 'sadmdte'
ObsData.loc[missing_Y, 'Y'] = ObsData['dthdte'][missing_Y] - ObsData['sadmdte'][missing_Y]
Now, let’s remove the outcomes we are not interested in for this example.
columns_to_drop = ['dthdte', 'lstctdte', 'dschdte', 'death', 't3d30', 'dth30', 'surv2md1']
ObsData = ObsData.drop(columns=columns_to_drop)
ObsData.head()
| cat1 | cat2 | ca | sadmdte | cardiohx | chfhx | dementhx | psychhx | chrpulhx | renalhx | ... | hema | seps | trauma | ortho | adld3p | urin1 | race | income | ptid | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | COPD | NaN | Yes | 11142 | 0 | 0 | 0 | 0 | 1 | 0 | ... | No | No | No | No | 0.0 | NaN | white | Under $11k | 5 | 9.0 |
| 1 | MOSF w/Sepsis | NaN | No | 11799 | 1 | 1 | 0 | 0 | 0 | 0 | ... | No | Yes | No | No | NaN | 1437.0 | white | Under $11k | 7 | 45.0 |
| 2 | MOSF w/Malignancy | MOSF w/Sepsis | Yes | 12083 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | NaN | 599.0 | white | $25-$50k | 9 | 60.0 |
| 3 | ARF | NaN | No | 11146 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | NaN | NaN | white | $11-$25k | 10 | 37.0 |
| 4 | MOSF w/Sepsis | NaN | No | 12035 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | NaN | 64.0 | white | Under $11k | 11 | 2.0 |
5 rows × 56 columns
Next, let’s remove variables deemed unnecessary and/or problematic.
columns_to_drop = ['sadmdte', 'ptid', 'adld3p', 'urin1', 'cat2']
ObsData = ObsData.drop(columns=columns_to_drop)
ObsData.head()
| cat1 | ca | cardiohx | chfhx | dementhx | psychhx | chrpulhx | renalhx | liverhx | gibledhx | ... | gastr | renal | meta | hema | seps | trauma | ortho | race | income | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | COPD | Yes | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | Under $11k | 9.0 |
| 1 | MOSF w/Sepsis | No | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | Yes | No | No | white | Under $11k | 45.0 |
| 2 | MOSF w/Malignancy | Yes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | $25-$50k | 60.0 |
| 3 | ARF | No | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | $11-$25k | 37.0 |
| 4 | MOSF w/Sepsis | No | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | No | No | No | No | No | No | No | white | Under $11k | 2.0 |
5 rows × 51 columns
The next step is to convert all categorical variables to factors.
# List of columns to convert to categorical
factors = ["cat1", "ca", "cardiohx", "chfhx", "dementhx", "psychhx",
"chrpulhx", "renalhx", "liverhx", "gibledhx", "malighx",
"immunhx", "transhx", "amihx", "sex", "dnr1", "ninsclas",
"resp", "card", "neuro", "gastr", "renal", "meta", "hema",
"seps", "trauma", "ortho", "race", "income"]
# Convert specified columns to category dtype
ObsData[factors] = ObsData[factors].astype('category')
Next, let’s convert our treament variable “swang1” to a binary variable, A. As a reminder, “swang1” indicates whether or not a patient recieved a RHC on the first day.
ObsData['A'] = np.where(ObsData['swang1'] == 'RHC', 1, 0)
Now that binary treatment A has been created, we can drop the swang1 column, as we no longer need it.
# Drop the 'swang1' column
ObsData = ObsData.drop(columns=['swang1'])
Now, let’s categorize the variables so that they match with the original paper.
# Set the order of categories for 'race'
ObsData['race'] = pd.Categorical(ObsData['race'], categories=['white', 'black', 'other'], ordered=False)
# Convert 'sex' to a categorical type and set reference level
ObsData['sex'] = pd.Categorical(ObsData['sex'], categories=['Male', 'Female'], ordered=False)
# Convert 'cat1' to a categorical type and set new levels
ObsData['cat1'] = ObsData['cat1'].replace({
"Cirrhosis": "Other", "Colon Cancer": "Other", "Coma": "Other",
"COPD": "Other", "Lung Cancer": "Other", "MOSF w/Malignancy": "MOSF", "MOSF w/Sepsis": "MOSF"
}).astype('category')
# Convert 'ca' to a categorical type and set new levels
ObsData['ca'] = ObsData['ca'].replace({"No": "None", "Yes": "Localized (Yes)"}).astype('category')
C:\Users\TESTUSER\AppData\Local\Temp\ipykernel_9532\3298340898.py:8: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
ObsData['cat1'] = ObsData['cat1'].replace({
C:\Users\TESTUSER\AppData\Local\Temp\ipykernel_9532\3298340898.py:14: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
ObsData['ca'] = ObsData['ca'].replace({"No": "None", "Yes": "Localized (Yes)"}).astype('category')
Lastly, let’s rename the variables so that they’re easier to work with and understand.
# Renaming the columns
ObsData.columns = ["Disease.category", "Cancer", "Cardiovascular",
"Congestive.HF", "Dementia", "Psychiatric", "Pulmonary",
"Renal", "Hepatic", "GI.Bleed", "Tumor",
"Immunosuppression", "Transfer.hx", "MI", "age", "sex",
"edu", "DASIndex", "APACHE.score", "Glasgow.Coma.Score",
"blood.pressure", "WBC", "Heart.rate", "Respiratory.rate",
"Temperature", "PaO2vs.FIO2", "Albumin", "Hematocrit",
"Bilirubin", "Creatinine", "Sodium", "Potassium", "PaCo2",
"PH", "Weight", "DNR.status", "Medical.insurance",
"Respiratory.Diag", "Cardiovascular.Diag",
"Neurological.Diag", "Gastrointestinal.Diag", "Renal.Diag",
"Metabolic.Diag", "Hematologic.Diag", "Sepsis.Diag",
"Trauma.Diag", "Orthopedic.Diag", "race", "income",
"Y", "A"]
Recall that Y is length of stay (outcome), and A is RHC use (treatment)
Introduction to Altair#
We will be using the Python package “Altair” to create our visualizations,designed for simplicity and high-level plotting. It can be used to create a wide range of interactive and visually appealing charts with minimal code, and rather specifically focus on the relationships within the data.
Altair is particularly useful because it allows users to generate complex visualizations without worrying about low-level plotting details, making data analysis more accessible and efficient. Its integration with visualization grammars ensures that users can create sophisticated and interactive visualizations effortlessly.
To start working with Altair, let’s first load in the package.
pip install altair
Collecting altair
Downloading altair-5.4.0-py3-none-any.whl (671 kB)
------------------------------------- 671.7/671.7 kB 14.1 MB/s eta 0:00:00
Requirement already satisfied: jinja2 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from altair) (3.1.4)
Requirement already satisfied: jsonschema>=3.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from altair) (4.17.1)
Collecting narwhals>=1.1.0 (from altair)
Downloading narwhals-1.3.0-py3-none-any.whl (138 kB)
-------------------------------------- 138.3/138.3 kB 8.5 MB/s eta 0:00:00
Requirement already satisfied: packaging in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from altair) (23.2)
Collecting typing-extensions>=4.10.0 (from altair)
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Requirement already satisfied: attrs>=17.4.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from jsonschema>=3.0->altair) (23.2.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from jsonschema>=3.0->altair) (0.19.2)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages (from jinja2->altair) (2.1.1)
Installing collected packages: typing-extensions, narwhals, altair
Attempting uninstall: typing-extensions
Found existing installation: typing_extensions 4.9.0
Uninstalling typing_extensions-4.9.0:
Successfully uninstalled typing_extensions-4.9.0
Successfully installed altair-5.4.0 narwhals-1.3.0 typing-extensions-4.12.2
Note: you may need to restart the kernel to use updated packages.
WARNING: Ignoring invalid distribution -cipy (c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -cipy (c:\users\testuser\appdata\local\programs\python\python310\lib\site-packages)
[notice] A new release of pip is available: 23.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
import altair as alt
Now, let’s examine the dimensions of our dataset.
ObsData.shape
(5735, 51)
It should also be noted that Altair only allows a maximum of 5000 rows/observations when creating visualizations. This is an issue, as our dataset has 5735 rows.
To address this problem, we can disable the MaxRows checker:
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
Now we are ready to proceed with our first visualization, where we aim to analyze the association between our treament (A) and outcome (Y) variables, which are RHC usage and length of stay, respectively.
Creating our First Visualization#
Let’s trace back to our understanding of visual grammars, and how that can be applied when creating plots in Altair.
For example, consider the following piece of Altair code:
alt.Chart(ObsData).mark_point().encode(x='A' , y='Y')
The above code block alt.Chart(ObsData).mark_point().encode(x='A', y='Y') can be described using the visualization grammar components as follows:
Creating a chart linked to a dataframe:
alt.Chart(ObsData)initializes a chart and links it to the dataframeObsData.
Adding graphical elements
.mark_point()specifies that the chart should use points to represent the data.
Encoding dataframe columns as visual channels:
.encode(x='A', y='Y')maps the dataframe column'A'to the x-axis and the column'Y'to the y-axis, defining how data is visually represented in the chart.
By incorporating the principles of visualization grammar, we can create clear and insightful visualizations. This structured approach enhances our ability to perform exploratory data analysis and communicate findings effectively.
The plot we have created compares how RHC is associated with the length of stay in the hospital. There does not seem to be any immediate indication that RHC has an effect on the length of stay. It appears that most patients typically stay less than 120 days.
Categorizing Visualizations by Colour#
What if we wanted to also map a categorical variable to the color scale?
We can utilize the ‘color’ argument of the .encode() function.
alt.Chart(ObsData).mark_point().encode(x='A' , y='Y', color='sex')
Now, we have taken the previous plot which looked at the association between RHC usage and length of stay and colored it by sex, male or female. Blue circles represent female, and orange represents male.
Let’s think about some conclusions that we can draw from this graph.
Distribution of Length of Stay: There is a wide range of hospital stay lengths for both groups (RHC and non-RHC). The lengths of stay vary greatly for both those who did and did not receive RHC, with no clear clustering of shorter or longer stays within each group.
Comparison Between RHC and Non-RHC Groups: Both groups (RHC = 0 and RHC = 1) show a similar spread in the length of hospital stay, indicating that RHC usage does not have an obvious impact on the length of stay based on this scatter plot alone.
Sex Differences: Both male and female patients are distributed across the entire range of hospital stays in both the RHC and non-RHC groups. There does not appear to be a significant difference in the pattern of hospital stays between males and females within each group (RHC and non-RHC).
Outliers: There are some outliers with very long hospital stays (e.g., around 350-400 days), but these are present in both RHC and non-RHC groups.
In summary, our scatterplot shows a broad and overlapping distribution of hospital stay lengths for both RHC users and non-users, with no clear pattern indicating that RHC usage significantly affects the length of hospital stay. Additionally, there are no noticeable differences in the length of stay between male and female patients within each RHC group. Just from the naked eye, it’s hard to tell which gender had more/less RHC usage, only that most of them typically had length of stays less than 120 days.
Side-by-Side Boxplots#
When looking at the visualization we created, it may be difficult to extract insights from looking strictly at two binary values, on an x-axis designed to be continuous. For this reason, we let’s instead take a look at side-by-side boxplots of the patient’s length of stay, separated by RHC usage.
How to Read Boxplots#
Boxplots are graphical representations of a dataset’s distribution, showing its central tendency, variability, and potential outliers. Each boxplot consists of a box, whiskers, and potentially individual points. The box spans from the first quartile (Q1) to the third quartile (Q3), with a line inside representing the median (Q2). The whiskers extend to the minimum and maximum values within 1.5 times the interquartile range (IQR) from the quartiles, while points outside this range are considered outliers and are plotted individually.
Note that we specifically indicated scale=alt.Scale(domain=[0, 400]) to ensure that all the numerical range along the y-axis remains consistent across both boxplots.
To create a side-by-side boxplot, we can simply create two individual boxplot objects, then concatenate them using Altair’s ‘|’ operator.
box1 = alt.Chart(ObsData[ObsData["A"]==0]).mark_boxplot().encode(
alt.Y("Y", scale=alt.Scale(domain=[0, 400]))
).properties(title="Boxplot of Y (No RHC Usage)")
box2 = alt.Chart(ObsData[ObsData["A"]==1]).mark_boxplot().encode(
alt.Y("Y")
).properties(title="Boxplot of Y (RHC Used)")
box1 | box2
From the above side-by-side boxplots, it appears that a great chunk of the data lies around 20 days or so, near the bottom of the boxpot.
There are many outliers, all mainly above the upper whisker, as displayed within our boxplots.
Morever, there does not seem to be a drastic difference between length of stay, and whether or not the patient underwent RHC usage.
If we wanted to increase the granularity of our plots further, we can break down each RHC usage category into genders, both male and female.
box1 = alt.Chart(ObsData[ObsData["A"]==0]).mark_boxplot().encode(
alt.Y("Y", scale=alt.Scale(domain=[0, 400]))
).properties().facet('sex', title="Boxplot of Y (No RHC Usage), Faceted by Sex")
box2 = alt.Chart(ObsData[ObsData["A"]==1]).mark_boxplot().encode(
alt.Y("Y")
).properties().facet('sex', title="Boxplot of Y (RHC Used), Faceted by Sex")
box1 | box2
It is evident from the boxplots created that the distinctions between each gender and RHC usage category is starting to vary more drastically.
This breakdown could be useful if we wished to conduct male-male or female-female comparisons, amongst varying RHC usage types.
Lastly, although not entirely recommended, if we wanted to stick with using a scatterplot to analyze potential association between RHC usage and length of stay, we can also colour our graphs based on sex, and view differences that way as well.
alt.Chart(ObsData).mark_boxplot().encode(
alt.Y("Y"),
alt.X("A"),
color='sex'
)
Age vs Length of Stay Scatterplot#
When it comes to scatterplots, it is typically most suitable when working with two quantitative variables.
Instead of looking at A (RHC usage) which is a binary variable, we can instead take a look at age, and potentially segregate our data further, by sex.
alt.Chart(ObsData).mark_point().encode(x='age' , y='Y')
The scatterplot shows the relationship between age and length of stay in the hospital (Y). From the plot, it is evident that most patients, regardless of age, have a length of stay under 100 days, with a dense concentration around 50 days or less. There are a few outliers composed of patients who have significantly longer stays, extending up to 350 days. Additionally, there is a noticeable spread in length of stay across all ages, suggesting no strong correlation between age and length of hospital stay, but rather a wide variability in stay durations for patients of all ages.
Module 2 Summary#
From this module, we learned:
The importance and purpose behind visualizing data.
The difference bewteen declarative and imperative visualizations.
Usage of visualization grammar in the context of Altair.
Creating scatterplots in Altair.
How to perform aggregations in-house.